title: “Peer Assessment I”
output:
html_document:
pandoc_args: [
“–number-sections”,
]

First, let us load the data and necessary packages:

load("ames_train.Rdata")
library(MASS)
library(dplyr)
library(ggplot2)
library(devtools)
library(dplyr)
library(statsr)
memory.limit(size = 40000)
## [1] 40000

1

Make a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution.

ames_train %>%
  mutate(House.Age = 2021-Year.Built) %>%
  ggplot(aes(x=House.Age)) + 
  geom_histogram(color = 'orange', fill = 'blue', bins = 30) + 
  ggtitle('Number of Houses by Age') + 
  labs(x = 'Age of House', y = 'Count') +
  theme(plot.title = element_text(hjust = 0.5))

* * *

From the histogram above, you can see that the age of the houses in the ames_train data set is right-skewed. The majority of the ages fall between 0 and 75 years old, and the frequency decreases significantly past 75 years old. Overall, the range of ages fall between around 10 years and around 135 years. The greatest frequency of age in the data set is around 15 years old (or houses built around 2006).


2

The mantra in real estate is “Location, Location, Location!” Make a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods? Report which neighborhoods these are based on the summary statistics of your choice. Report the value of your chosen summary statistics for these neighborhoods.

ames_train_summary <- ames_train %>%
  group_by(Neighborhood) %>%
  summarise(Count = n(), sum_price = sum(price), avg_price = mean(price), med_price = median(price), min_price = min(price),
            max_price = max(price), q1_price = quantile(price, 0.25), q3_price = quantile(price, 0.75), 
            IQR_price = q3_price - q1_price, Range_price = max_price-min_price)

ames_train_summary
## # A tibble: 27 x 11
##    Neighborhood Count sum_price avg_price med_price min_price max_price q1_price
##    <fct>        <int>     <int>     <dbl>     <dbl>     <int>     <int>    <dbl>
##  1 Blmngtn         11   2184980   198635.    191000    172500    246990  182112.
##  2 Blueste          3    377400   125800     123900    116500    137000  120200 
##  3 BrDale          10    989300    98930      98750     83000    125500   88250 
##  4 BrkSide         41   5021375   122473.    124000     39300    207000   93000 
##  5 ClearCr         13   2511000   193154.    185000    107500    277000  164000 
##  6 CollgCr         85  16740823   196951.    195800    110000    475000  160000 
##  7 Crawfor         29   5921700   204197.    205000     96500    392500  154900 
##  8 Edwards         60   8179321   136322.    127250     61500    415000  112250 
##  9 Gilbert         49   9473073   193328.    183500    133000    377500  171500 
## 10 Greens           4    794250   198562.    212625    155000    214000  197375 
## # ... with 17 more rows, and 3 more variables: q3_price <dbl>, IQR_price <dbl>,
## #   Range_price <int>
ames_train %>%
  ggplot(aes(x=Neighborhood, y = price)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 45))

library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
register_google(key = 'AIzaSyDEhGTvFCq3NcLjNXkrd67zNjzpa9sE4Do', write = TRUE)
## Replacing old key (AIzaSyDEhGTvFCq3NcLjNXkrd67zNjzpa9sE4Do) with new key in C:/Users/crobinson/OneDrive - Surplus Lines Stamping Office of Texas/Documents/.Renviron
ames <- get_map('ames', zoom = 13)
## Source : https://maps.googleapis.com/maps/api/staticmap?center=ames&zoom=13&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=ames&key=xxx
AmesMap <- ggmap(ames, extent = 'device', legend = 'topleft')


Somerst <- data.frame(lat = 42.052650, lon = -93.644843)
BrkSide <- data.frame(lat = 42.02881903912064, lon = -93.63049188234545)
ClearCr <- data.frame(lat = 42.03657397888249, lon = -93.64887304418023)
CollgCr <- data.frame(lat = 42.025494912034155, lon = -93.63771636423816)
Crawfor <- data.frame(lat = 42.028224218062654, lon = -93.60710771543098)
Edwards <- data.frame(lat = 42.015565977136575, lon = -93.68587339952029)
Gilbert <- data.frame(lat = 42.1068163474924, lon = -93.64922121258348)
Greens <- data.frame(lat = 42.04331212046676, lon = -93.64913757887277)
GrnHill <- data.frame(lat = 42.002033485388246, lon = -93.64488427869496)
IDOTRR <- data.frame(lat = 42.022501275424275, lon = -93.62203456681684)
Blueste <- data.frame(lat = 42.00953717163152, lon = -93.64666903101156)
Blmngtn <- data.frame(lat = 42.05661593716759, lon = -93.63520641999354)
BrDale <- data.frame(lat = 42.05884455916529, lon = -93.63684168659509)
Landmrk <- data.frame(lat = 42.03024936721511, lon = -93.62022469009929)
MeadowV <- data.frame(lat = 41.99257995048931, lon = -93.60273053285842)
NAmes <- data.frame(lat = 42.045740696132356, lon = -93.62058491915803)
NoRidge <- data.frame(lat = 42.05369544009003, lon = -93.64859112044729)
NPkVill <- data.frame(lat = 42.04924453477436, lon = -93.62364254176953)
NridgHt <- data.frame(lat = 42.0602670643559, lon = -93.64938546931343)
NWAmes <- data.frame(lat = 42.05241416223634, lon = -93.63024432475562)
OldTown <- data.frame(lat = 42.029540103263216, lon = -93.61408260795548)
SWISU <- data.frame(lat = 42.022754190113034, lon = -93.64994877394756)
Sawyer <- data.frame(lat = 42.03223653964385, lon = 93.67666644205046)
SawyerW <- data.frame(lat = 42.03362757149064, lon = -93.681495277844183)
StoneBr <- data.frame(lat = 42.059842280171544, lon =  -93.6376421021748)
Timber <- data.frame(lat = 42.01824464087094, lon = -93.62028083863699)
Veenker <- data.frame(lat = 42.04176633660527, lon = -93.64888559375515)
Mitchel <- data.frame(lat = 41.991536429285624, lon = -93.60096746931343)

Locations <- data.frame(Neighborhood = c('Blmngtn', 'Somerst', 'BrkSide', 'ClearCr', 'CollgCr', 'Crawfor', 'Edwards', 'Gilbert', 'Greens', 'GrnHill', 'IDOTRR', 'Blueste', 'BrDale', 'Landmrk', 'MeadowV', 'NAmes', 'NoRidge', 'NPkVill', 'NridgHt', 'NWAmes', 'OldTown', 'SWISU', 'Sawyer', 'SawyerW', 'StoneBr', 'Timber', 'Veenker', 'Mitchel'), lat = c(Blmngtn$lat, Somerst$lat, BrkSide$lat, ClearCr$lat, CollgCr$lat, Crawfor$lat, Edwards$lat, Gilbert$lat, Greens$lat, GrnHill$lat, IDOTRR$lat, Blueste$lat, BrDale$lat, Landmrk$lat, MeadowV$lat, NAmes$lat, NoRidge$lat, NPkVill$lat, NridgHt$lat, NWAmes$lat, OldTown$lat, SWISU$lat, Sawyer$lat, SawyerW$lat, StoneBr$lat, Timber$lat, Veenker$lat, Mitchel$lat),  lon = c(Blmngtn$lon, Somerst$lon, BrkSide$lon, ClearCr$lon, CollgCr$lon, Crawfor$lon, Edwards$lon, Gilbert$lon, Greens$lon, GrnHill$lon, IDOTRR$lon, Blueste$lon, BrDale$lon, Landmrk$lon, MeadowV$lon, NAmes$lon, NoRidge$lon, NPkVill$lon, NridgHt$lon, NWAmes$lon, OldTown$lon, SWISU$lon, Sawyer$lon, SawyerW$lon, StoneBr$lon, Timber$lon, Veenker$lon, Mitchel$lon))

ames_train_summary <- ames_train_summary %>%
  left_join(Locations, by = c('Neighborhood' = 'Neighborhood'))

AmesMap + geom_point(aes(x=lon, y=lat, colour = avg_price, size = avg_price, stroke = 7), data = ames_train_summary) + scale_colour_gradient(low = 'yellow', high = 'red') + ggtitle("Average House Price by Neighborhood") + theme(plot.title = element_text(hjust = 0.5)) + ggrepel::geom_label_repel(data = ames_train_summary, mapping = aes(x=lon, y = lat, label = Neighborhood))
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_label_repel).
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf

AmesMap + geom_point(aes(x=lon, y=lat, colour = sum_price, size = sum_price, stroke = 7), data = ames_train_summary) + scale_colour_gradient(low = 'yellow', high = 'red') + ggtitle("Total House Price by Neighborhood") + theme(plot.title = element_text(hjust = 0.5)) + ggrepel::geom_label_repel(data = ames_train_summary, mapping = aes(x=lon, y = lat, label = Neighborhood))
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_label_repel).
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf

AmesMap + geom_point(aes(x=lon, y=lat, colour = IQR_price, size = IQR_price, stroke = 7), data = ames_train_summary) + scale_colour_gradient(low = 'yellow', high = 'red') + ggtitle("IQR of House Price by Neighborhood") + theme(plot.title = element_text(hjust = 0.5)) + ggrepel::geom_label_repel(data = ames_train_summary, mapping = aes(x=lon, y = lat, label = Neighborhood))
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_label_repel).
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf

AmesMap + geom_point(aes(x=lon, y=lat, colour = Range_price, size = Range_price, stroke = 7), data = ames_train_summary) + scale_colour_gradient(low = 'yellow', high = 'red') + ggtitle("Range of House Price by Neighborhood") + theme(plot.title = element_text(hjust = 0.5)) + ggrepel::geom_label_repel(data = ames_train_summary, mapping = aes(x=lon, y = lat, label = Neighborhood))
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_label_repel).
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf

* * *

As you can see from the boxplots and visual maps above, the NridgHt and StoneBr neighborhoods have the highest center and highest max values for house prices. Additionally, the BrDale, Blueste, MeadowV, Greens, and NPkVill have consistently low prices, having low centers and low IQR and Ranges. From the maps, you can see that the Northwest and Southeast corners of Ames have the highest average prices, but they also have the largest ranges. As you move closer into the center of the city, the average prices decrease. From the boxplots, you can see that the majority of the neighborhoods are not normally distributed by price, so I think the median price is the best representative of the true center.

To be more specific, the NridgHt and StoneBr neighborhoods have an average price of $333,646.74 and $339,316.05 respectively, and a max price of $615,000.00 and $591,587.00 respectively.

Further, the BrDale, Blueste, MeadowV, Greens and NPKVill neighborhoods have an IQR of $16,725, $10,250, $20,150, $16,438 and $13,025 and a range of $42,500, $20,500, $56,500, $59,000 and $26,900 respectively. * * *

3

Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.

ames_train %>%
  summarise_all(list(~sum(is.na(.)))) %>%
  select_if(~sum(.) != 0) %>%
  sort(c(1:25), decreasing = TRUE)
## Warning in xtfrm.data.frame(x): cannot xtfrm data frames
## # A tibble: 1 x 25
##   Pool.QC Misc.Feature Alley Fence Fireplace.Qu Lot.Frontage Garage.Yr.Blt
##     <int>        <int> <int> <int>        <int>        <int>         <int>
## 1     997          971   933   798          491          167            48
## # ... with 18 more variables: Garage.Qual <int>, Garage.Cond <int>,
## #   Garage.Type <int>, Garage.Finish <int>, Bsmt.Qual <int>, Bsmt.Cond <int>,
## #   Bsmt.Exposure <int>, BsmtFin.Type.1 <int>, BsmtFin.Type.2 <int>,
## #   Mas.Vnr.Area <int>, BsmtFin.SF.1 <int>, BsmtFin.SF.2 <int>,
## #   Bsmt.Unf.SF <int>, Total.Bsmt.SF <int>, Bsmt.Full.Bath <int>,
## #   Bsmt.Half.Bath <int>, Garage.Cars <int>, Garage.Area <int>

The Pool.QC variable has the most NA values (997), which makes sense because the value is NA if the house has no pool. Since NA for the variable simply indicates that the house has no pool, it makes sense that the majority of houses do not have a pool.


4

We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.

ames_train <- ames_train %>%
  mutate(logprice = log(price))

ames_train_logprice <- ames_train %>%
  select(-c(PID, price))

ames_train_logprice_nona  <- ames_train_logprice %>%
  select(Lot.Area, Land.Slope, Year.Built, Year.Remod.Add, Bedroom.AbvGr, logprice) 

#now, create a MLR model which includes the aforementioned variables (`Lot.Area`, `Land.Slope`, `Year.Built`, `Year.Remod.Add`, and `Bedroom.AbvGr)
  
m_ames_train_abbr <- lm(logprice ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = ames_train_logprice_nona)

BIC(m_ames_train_abbr)
## [1] 333.3642
# The BIC for the full model is 333.3642, so any model that excludes variables will have to have a BIC lower than that.

n <- ncol(ames_train_logprice_nona[, -6])
#BIC Model
score_step <- stepAIC(m_ames_train_abbr, k = log(n))
## Start:  AIC=-2548.51
## logprice ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + 
##     Bedroom.AbvGr
## 
##                  Df Sum of Sq    RSS     AIC
## <none>                        77.322 -2548.5
## - Land.Slope      2    1.4281 78.750 -2533.4
## - Bedroom.AbvGr   1    5.0628 82.385 -2486.7
## - Lot.Area        1    6.7292 84.051 -2466.7
## - Year.Remod.Add  1   11.9642 89.286 -2406.2
## - Year.Built      1   19.8546 97.177 -2321.6
score_step
## 
## Call:
## lm(formula = logprice ~ Lot.Area + Land.Slope + Year.Built + 
##     Year.Remod.Add + Bedroom.AbvGr, data = ames_train_logprice_nona)
## 
## Coefficients:
##    (Intercept)        Lot.Area   Land.SlopeMod   Land.SlopeSev      Year.Built  
##     -1.371e+01       1.028e-05       1.384e-01      -4.567e-01       6.049e-03  
## Year.Remod.Add   Bedroom.AbvGr  
##      6.778e-03       8.686e-02

Using the backwards elimination method to find the model that minimizes BIC, all of the variables are included in the model. Backwards elimination model selection first considers the model that includes all possible explanatory variables in the data set and provides the BIC of that model. Then, the method will eliminate one variable at a time until the BIC is minimized. Therefore, the final model will include explanatory variables so that no additional variable in the data set will improve the model.


To evaluate the MLR model from above for regression, we will check the linearity conditions, namely:

Linearity and Constant Variance

m_ames_train_abbr_aug <- broom::augment(m_ames_train_abbr)

m_ames_train_abbr_aug %>%
    ggplot(aes(x=.fitted, y=.resid)) +
    geom_point(alpha = 0.6) +
    geom_hline(yintercept = 0, linetype = 'dashed') +
    labs(x= "Fitted Values", y= "Residuals")

#Normal Q-Q Plot
qqnorm(m_ames_train_abbr$residuals)
qqline(m_ames_train_abbr$residuals)

Normality: Check normality of the distribution of residuals for the MLR model

m_ames_train_abbr_aug %>%
    ggplot(aes(x=.resid)) +
    geom_histogram(binwidth = 0.25) +
    xlab('Residuals')

5

Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?

ames_train_residuals <- ames_train_logprice_nona %>%
  mutate(residuals = m_ames_train_abbr$residuals, prediction = m_ames_train_abbr$fitted.values, sqresidual = residuals^2) %>%
  filter(sqresidual == max(sqresidual)) %>%
  select(logprice, residuals, prediction, sqresidual)

ames_train_maxresid <- ames_train %>%
  filter(logprice == ames_train_residuals$logprice) 

The house that has the largest squared residual is PID 902207130, whose area is 832 sqft and lot size is 9656 sqft. The relevant data point has the lowest value for logprice and, consequently, is the furthest value from the center of logprice. Because the point is an outlier regarding logprice, it makes sense that it would vary the most from the predicted value. It appears the variables that likely contribute to the abnormally low price of the house, other than the Year.Built, are the overall quality and condition of the house. However, these variables do not exist in our model, so they can’t be accounted for directly.


6

Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?

ames_train_abbr_logarea <- ames_train_logprice_nona %>%
  mutate(logarea = log(Lot.Area)) %>%
  select(-Lot.Area)
  
m_ames_train_loglot<- lm(logprice ~ ., data = ames_train_abbr_logarea)

BIC(m_ames_train_loglot)
## [1] 229.6411
# The BIC for the full model is 3.937438, so any model that excludes variables will have to have a BIC lower than that.

n <- ncol(ames_train_abbr_logarea[, 6])
#BIC Model
score_step_loglot <- stepAIC(m_ames_train_loglot, k = log(n))
## Start:  AIC=-2663.5
## logprice ~ Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr + 
##     logarea
## 
##                  Df Sum of Sq    RSS     AIC
## <none>                        69.704 -2663.5
## - Land.Slope      2    0.4429 70.147 -2657.2
## - Bedroom.AbvGr   1    2.2002 71.904 -2632.4
## - Year.Remod.Add  1   11.8182 81.522 -2506.9
## - logarea         1   14.3474 84.051 -2476.3
## - Year.Built      1   19.4083 89.112 -2417.9
score_step_loglot
## 
## Call:
## lm(formula = logprice ~ Land.Slope + Year.Built + Year.Remod.Add + 
##     Bedroom.AbvGr + logarea, data = ames_train_abbr_logarea)
## 
## Coefficients:
##    (Intercept)   Land.SlopeMod   Land.SlopeSev      Year.Built  Year.Remod.Add  
##     -15.530123        0.115076       -0.065541        0.005981        0.006734  
##  Bedroom.AbvGr         logarea  
##       0.059089        0.244239

After replacing Lot.Area with log(Lot.Area), the model that minimizes BIC does contain the same predictors, except for replacing Lot.Area with log(Lot.Area).


7

Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7.

# First, we will illustrate the linearity of Lot.Area, evaluating a scatter plot of the residuals, a histogram showing the distribution of the residuals, and a QQ-Plot of the residuals to again evaluate normality
g_lotarea_scatter <- m_ames_train_abbr %>%
  ggplot(aes(x=.fitted, y=.resid)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = 'dashed') +
  labs(x= "Fitted Values", y= "Residuals") + 
  ggtitle("Scatter Plot of Residuals for Lot.Area")
g_lotarea_histogram <- m_ames_train_abbr %>%
  ggplot(aes(x=.resid)) + 
  geom_histogram()+
  xlab("Residuals") +
  ggtitle("Distribution of Residuals for Lot.Area")
g_lotarea_qqplot <- m_ames_train_abbr %>%
  ggplot(aes(sample = .resid)) +
  geom_qq_line() + 
  geom_qq()
g_lotarea_scatter

g_lotarea_histogram
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

g_lotarea_qqplot

# Now, we will repeat the process, but to evaluate the linearity of log(Lot.Area)
g_loglotarea_scatter <- m_ames_train_loglot %>%
  ggplot(aes(x=.fitted, y=.resid)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = 'dashed') +
  labs(x= "Fitted Values", y= "Residuals") + 
  ggtitle("Scatter Plot of Residuals for log(Lot.Area)")
g_loglotarea_histogram <- m_ames_train_loglot %>%
  ggplot(aes(x=.resid)) + 
  geom_histogram()+
  xlab("Residuals") +
  ggtitle("Distribution of Residuals for log(Lot.Area)")
g_loglotarea_qqplot <- m_ames_train_loglot %>%
  ggplot(aes(sample = .resid)) +
  geom_qq_line() + 
  geom_qq()
g_loglotarea_scatter

g_loglotarea_histogram
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

g_loglotarea_qqplot


After comparing the linearity of the models that include Lot.Area and log(Lot.Area), it is apparent that Lot.Area should be replaced with the log transformed Lot.Area in our MLR model. The scatter plot for the residuals of Lot.Area are very concentrated around a fitted value of 12, with a couple of notable outliers between 13.0 and 13.5. However, the scatter plot for the residuals of log(Lot.Area) is much more evenly scattered around 0, and it does not have the significant outliers shown in the scatter plot for Lot.Area.

Regarding the distribution of residuals, the distribution for Lot.Area is skewed left with a notable outlier around -2. Disregarding the outlier, the distribution is roughly normal. After log transforming the aforementioned variable, the residual plot looks very similar. The outlier around -2 still exists, and the distribution is still roughly normal, but the frequency of residuals at 0 is greater for the model with log(Lot.Area).

Finally, when evaluating the normality of the residuals using a QQ-plot, you can see that the residual values for log(Lot.Area) are closer to the line for values greater than 0; however, for negative residuals, the values are about the same distance from the line as with Lot.Area. Therefore, we can say that the distribution for log(Lot.Area) is slightly more normal than Lot.Area, though not significantly.


7.0.1